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Abstract 



Optimal prediction methods compensate for a lack of resolution in the nu- 
merical solution of time-dependent differential equations through the use of prior 
statistical information. We present a new derivation of the basic methodology, 
show that field-theoretical perturbation theory provides a useful device for dealing 
with quasi-linear problems, and provide a nonlinear example that illuminates the 
difference between a pseudo-spectral method and an optimal prediction method 
with Fourier kernels. Along the way, we explain the differences and similarities 
between optimal prediction, the representer method in data assimilation, and du- 
ality methods for finding weak solutions. We also discuss the conditions under 
which a simple implementation of the optimal prediction method can be expected 
to perform well. 
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1 Introduction 

We consider a Cauchy problem 



where R{x,u,Ux...) is a (generally nonlinear) function of m = u{x,t), of its spatial 
derivatives, and of the independent variable x, in any number of dimensions; subscripts 
denote differentiation. We assume that we cannot afford to use enough computational 
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elements (for example, mesh points) to resolve the problem adequately, or that we do 
not wish to use many computational elements because we are only interested in some 
of the features of the solutions and it seems to be wasteful to compute all the details. 
We also assume that we have some prior statistical information about the distribution 
of possible solutions. The question we address is how the prior statistics can be used to 
obviate the need for resolution. In contrast to problems that arise in some applications, 
especially in geophysics fl]], we assume that the equations we are solving are fully known, 
and that the data are knowable in principle, even if we may not be able or willing to 
store them all in a computer's memory. 

In the present paper we assume that our statistical information consists of an invari- 
ant measure /i on the space of solutions, i.e., we assume that the initial data are sampled 
from a probability distribution on the space of data, and that, in principle, if one takes 
one instance of initial data after the other and computes the solutions produced by ( p..lD 
at any later time t, then the set of solutions obtained at that later time t (viewed as 
functions of the spatial variables) has the same probability distribution as the set of 
initial data. We assume further that this probability distribution is explicitly known. In 
section 4 below we give an example where these assumptions are satisfied and provide 
more precise definitions. Our present assumptions may be unnecessarily strong for some 
practical problems, but they simplify the exposition and it is often easy to weaken them. 
We shall call the invariant measure the prior measure; thus the prior measure is the dis- 
tribution of the data before anything has been specified about a particular problem. In 
Hamiltonian systems a natural prior measure is the canonical measure induced by the 
Hamiltonian 7i, i.e., a measure defined by the probability density 

where Z is a normalization constant and the parameter T, which determines the variance 
of the density, is known as the "temperature". We do not assume that the differential 
equation ( |1 . 1|) admits a unique invariant measure; in cases of non-uniqueness the right 
choice of measure is part of the formulation of the problem. 

We further assume that all we know, or care to know, at time t = 0, is a small set 
of data. In the present paper we choose these data to be of the form 

{ga,u) = j ga{x)u{x,0) dx = Va, a = l,...,N, (1.2) 

where the Qa = ga{x) are suitably chosen kernels (the choice of kernels Qa is at our 
disposal). The question we are asking can now be rephrased as follows: How do we best 
predict the future using only variables defined as in ( |1.2| ), given the prior measure /i? 
At time t = one can, at least in principle, find the mean solution conditioned by 



the data (]_^), for every point x, i.e., at each x one can average over all those functions 
in the support of the invariant measure that also satisfy the conditions (|1.2|) and find 
the mean v of u given the values Va, a = 1, ...,N; symbolically, 

v{x)=E[u{x)\V,,...,Vr,], (1.3) 

where E[-] denotes an expectation value. This is a regression problem, which can be 
solved by standard tools ([^, see also below). If one is given only the information 
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contained in (|1.2|), this regression is the right substitute for a more detailed knowledge 
of the data. To perform the regression at a later time t we need appropriate conditions 
at that later time, which should encode the later effect of the initial data ( p..2D . The 
problem now at hand is how to find these conditions and how to do it efficiently, in 
linear and in nonlinear problems. 

We have already explained our basic approach in [H, H |[. In the present paper we 
explain it in a different way which we hope is more transparent; the suggestion that a 
weak formulation is the right starting point is due to Gottlieb [|1T| . The paper has several 
goals: to explain what constitutes the novelty of our approach, to show how it meshes 
with field-theoretical perturbation methods, to give a simple, explicit, nonlinear example 
of the ways in which the basic algorithm differs from, and is superior to, a numerical 
method that uses the information in the initial data without taking advantage of the 
prior statistics, and to discuss the domain of applicability of the current implementation 
of the basic ideas. Our examples are of Schrodinger type. 



Other work along somewhat analogous lines includes Scotti and Meneveau [24|, where 



a "fractal" interpolation can be viewed as an analog of sampling a measure, and a stochas- 



tic construction by Vaillant |25] 



A comment on notations: The notation in equation ( |1.3|) is clear but cumbersome. 
A shorter version is: f = -E[-u|V"]. We shall also use the simpler but less transparent 
physicists' notations: (u) for E[u] and {u)y for £'[m|V^]. Both notations {u)y, E[u\V] 
are of course shorthands for {u)y^g^ etc., since the conditioning ( |1.2| ) depends on the 
kernels as well as on the right-hand sides. 

The paper is organized as follows: In section 2 we present an overview of optimal 
prediction. In sections 3, 4, we present some needed background material on regression 
and invariant measures. In sections 5, 6 we explain how perturbation theory can be used 
to implement optimal prediction in nonlinear problems, and provide examples of success 
and failure in problems with sparse data. In section 7 we provide a detailed analysis 
of a problem in which perturbation theory is carried only to zero-th order (but with 
a ground state that our previous work shows to be optimal), and in which the kernels 
are trigonometric functions. These simplifications yield results that are particularly 
transparent. Conclusions are drawn in a final section. 



2 Weak solutions, regression, and prediction 

We start with well-known considerations about weak solutions of linear equations; they 
lead naturally to an algorithm for solving underresolved linear equations, which will be 
useful below, in particular because we shall be able to present our proposals by way of 
a contrast. Consider the equation: 



ut = Lu, (2.1) 

where L is a linear operator. Multiply ( p.l| ) by a smooth test function g; for simplicity 
assume that the boundary conditions on u are periodic, and thus g can be also assumed 



Optimal prediction for Hamiltonian partial differential equations 4 



periodic. Integration over a periodic domain in x and between t = and t = t, followed 
by an integration by parts, results in 

/ [ {gt + L^g) udxdt+ (^(r), (^(r)) - {u{0),g{0)) = 0, (2.2) 

Jt=0 Jx 

where {u{t),g{t)) denotes J u{x,t)g{x,t) dx and L'^ is the adjoint of L. A weak solution 
of (|2.1|) is a function u{x,t) that satisfies (|2.2| ) for all test functions g (see e.g.[|^). In 
particular, if the test functions ga satisfy the adjoint equation 

^ + L^g^ = 0, a = l,...,N, (2.3) 

then ti is a weak solution of (|2.1|) provided {u, ga) is a constant independent of t for each 
a. This observation produces a possible numerical method for solving ( |2.1| ): One can 
construct a collection of functions ga that satisfy the adjoint equation ( ^^ , find the 
numerical values Va of the inner products {u, ga), {g = g{x,0)), u being the initial data, 
and finally reconstruct the weak solution at a later time r from its inner products with 
the functions g{x,T). These functions, g{x,t), can be found at time t = r if they are 
known at t = 0. 

If this method is used with a small number of functions (i.e., small A^), the solution 
at a time t > will be underdetermined. In this case one can use the invariant measure 
fi to "fill in" the gaps through regression, i.e., replace the weak solution which is not 
completely known by its average (as determined by fi) over all solutions that satisfy 
the conditions {ga,u) = Va- In other words, replace the function u which is not 
completely known by the regression v = v{x,t) = {u)y. Some technical background on 
regression follows in 

Note that so far, the construction resembles what is quite commonly done in un- 
derdetermined linear problems (for example, in the context of data assimilation [|l|; the 
function ga are analogous to the "representers" which are used there). The construc- 
tion is computationally useful in certain linear problems even when alternate ways of 
finding future regressions are available, as happens whenever time evolution and aver- 
aging commute (see 0). However, the construction just presented is restricted to linear 
problems, and the amount of work needed to evaluate and store the kernel functions ga 
may not be trivial. It is also clear that not all of the available information has been 
used, as the evolution of the kernels ga is independent of the invariant measure fi, and 
the measure used in the regression need not be connected with the differential equation. 
Indeed, in geophysical applications the measure is chosen according to considerations 
quite extraneous to the differential equations which may be only partly known. 

We now wish to show how the machinery can be modified through the use of the 
invariant measure in the evolution equations for the conditions, so that it becomes more 
efficient as well as generalizable to nonlinear problems. 

First, we must view the calculations differently. The measure /i induces a measure on 
the space of the data u that satisfy the initial conditions: the conditional measure fiy 
The conditional measure is not invariant] for example, if the initial conditions consist 
of A^ point-values of the functions u{x, 0), i.e., if we assume that at t = the functions 
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satisfy u{xa,t = 0) = Va (the functions ga are then 6 functions), there is no reason to 
beheve that these conditions will be satisfied at all subsequent times with the same Va 
by the solutions of the differential equations that arise out of these initial data. 

Imagine first that one can sample the initial conditional measure fiy] find the (in 
general, weak) solution of equation ( |2.1D that has this datum as initial condition, and 
perform this procedure repeatedly. At time t this produces an ensemble of functions 
u{x,t) which inherits a measure from the initial data. In principle, this measure is 
well-determined; we wish to determine it in practice, and then to average with respect 
to this measure so as to obtain what we call an optimal prediction. Note that if the 
temporal evolution governed by ( |1.1| ) is ergodic with respect to the invariant measure, 
the conditional measure will eventually relax to the invariant measure and the initial 
conditions will be forgotten; we are in fact dealing with a computational analog of 
non-equilibrium statistical mechanics. 

Given the initial conditional measure, we can find the statistics of Lu, or, in the 
general case ( [1.1| ), the statistics of R{u) and consequently the statistics of Ut; thus, the 
evolution of the measure fiy can be determined for a short time interval At. We cannot 
go beyond a short time interval because the measure fiy at time At can no longer be 
described as the invariant measure conditioned by the conditions ( [L.2D , at least not with 
the same functions Qa and the same Va. This leads us to the closure assumption: 

Assumption 2.1 (Closure) The conditional probability measure at time t, i^vif), can 
be approximated by 

Hv{t) = fiv{t), (2.4) 



where the left hand side Hvit) is the measure conditioned by the initial data while 
the right hand side is the invariant measure conditioned by N affine conditions of the 
form 

{ga{t),u{t)) = j gaix, t)u{x, t) dx = Vait). 

The kernels ga{t) and the values Va{t) of the inner products will generally be different 
at time t than at time t = 0, but it is assumed here that the afiine form of the conditions 
and their number remain constant. We have already seen that in the linear case equation 
( p.4|) is in fact a theorem (a different analysis of this fact was given in [^). 

It should be noted that even though this is the assumption that will be used in the 
present paper, it may be unduly restrictive in other situations; there is nothing magical 
about keeping the number of conditions fixed, and the conditions need not in general 
be affine. 

What we wish to do is to advance the measure to time At and then to find the 
conditions that produce that new measure at the new time, when these conditions 
condition the invariant measure. In that way, the whole process can be taken a step 
further and then repeated as often as one may wish. In the linear case we have already 
produced a recipe for updating the conditions: there, we let ga satisfy the adjoint 
equation and keep the numbers Va fixed. We now propose different, approximate, ways 
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of finding conditions that describe the evolving measure. We are going to do so by 
matching moments; on one hand, we will calculate moments of the conditional measure 
by regression from the old moments, and on the other hand, we will produce conditions 
that produce the new moments from the invariant measure; this will produce equations 
of motion for the conditions. 

More specifically, suppose that we compute Nq moments of m, at time At. For 
example, set q = 2, and compute the means and the variances with respect to the 
conditioned measure of the random variables u{x,At) at each of the points Xa- We 
can do this knowing the conditions at time t = and the invariant measure. On the 
other hand, suppose we let the functions ga depend on g — 1 parameters; for example, 
if g = 2, then we can pick ga{x) = exp((x — Xq,)^/^^), where the "centers" Xa are fixed 
and the numbers may be allowed to vary in time and will serve as our parameters. 
If we write down the requirement that the moments we calculated at time At match 
the moments produced by conditioning the invariant measure by affine conditions with 
unknown values of parameters such as the cTq and unknown values of the right-hand 
sides Va, we obtain Nq algebraic equations for the parameters, which we can try to 
solve. If we solve these equations successfully, we obtain a set of simultaneous ordinary 
differential equations for the parameters and the moments. 

Before carrying out such a calculation, two remarks: There is no a priori guarantee 
that the algebraic equations we will obtain can be always solved: The basic assumption 
may fail, and the choice of parameters may be unsuitable. A good numerical program 
will inform us that a solution cannot be found. However, if a solution is found, the re- 
sulting moments are realizable. It is well-known that closures may well produce moments 
that not only fail to solve the problem at hand but do not solve any problem, because 
there is no stochastic process that admits the computed moments as its moments (see 
e.g.p2|). This is often a major difficulty in the formulation of mean equations, and it 
does not arise here. 

We limit ourselves here to the simplest case with q = 1, i.e., for each a,a = 1, N 
we keep track of a single quantity, which we choose to be the mean value of the inner 
product of ga and u, {{ga,u))v = ida, {'^)v)^ while at the same time we modify a single 
parameter in each condition; that single parameter is chosen to be the value Va of the 
a-th product. Thus we must have for a = 1, A^, 

^Va = {iga,Ut))y = {iga,Riu)))y = {g a , {R{u)) y) , 

where R{u) is the right-hand side of equation ( |1.1| ). The final equation, 

^ = (^?„,(i?(«)V), a = l...Ar, (2.5) 

is the main equation used in the present paper. The more transparent mathematical 
notation, 

E [{ga, R{u))\Vi, . . . , Viv] , a = 1 ... AT, 



dt 

makes explicit the fact that we are dealing with a system of A^ ordinary differential 
equations. Once the Va are found at time t, a regression can be used to find the 
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average solution at any point x (see We shall call an algorithm that uses, in the 
matching of moments and parameters, only means of the unknown solutions and no 
higher moments, a first-order prediction scheme; in the present paper this is the only 
prediction scheme we shall use. We hope to demonstrate that a first-order prediction is 
often an improvement over algorithms that take no cognizance of the invariant measure, 
in the sense that it requires less computational labor than the alternatives; higher-order 
and more sophisticated optimal prediction schemes will be described in subsequent work. 

It is useful to contrast our algorithm with the one at the beginning of the section: 
We are keeping the kernels Qa fixed while changing the values Va of the conditions, while 
the "natural", linear construction at the beginning of the section did the opposite. 

When can we expect the first order optimal prediction scheme to be accurate? In a 
linear problem Ut = Lu, if the kernels Qa are eigenfunctions of the operator L"'" adjoint 
to L, then one can readily see from the analysis at the beginning of the section that 
equations ( |2.5| ) are exact, and this remains true if the Qa span an invariant subspace 
of + = 0. Lowest-order optimal prediction should be accurate as long as the 
Qa span a space that is approximately invariant under the flow induced by (see 



TT|],[0). This remark also provides a recipe for choosing the kernels. Something similar 



remains true in nonlinear problems 0, [0,p6[: Define the space of functions spanned 
by the kernels ga to be the resolved part of the solution; equations (p.5|) should yield an 
accurate prediction of this resolved part (including a correct accounting for the effect of 
unresolved components on the resolved components) as long as there is no substantial 
transfer of information from the resolved part to the unresolved part and back; if such 
information transfer should occur, a correct description of the flow should include an 
additional, "memory", non-Markovian term. This remark also points out that equations 
( ^.51 ) should not be accurate for very long times (because memory terms are important 
to the description of decay to equilibrium), and should be better at low temperatures T 
than at high temperatures (because the decay to equilibrium should be more rapid at 
high T). 

There are other ways to ensure the accuracy of a flrst order optimal prediction 
scheme: The quantities {ga,R{u)) are, of course, random variables whose distribution 
depends on the measure nv- If the standard deviations of these variables are small, then 
equations ( |2.5| ) are be good approximations to the exact solution. Indeed, equation ( [^.5[ ) 
merely equates these random variables to their means; a higher order approximation 
would take into account the variance of these variables as well. The smaller the variance 
of the variables {ga, R{u)), the smaller the error we expect; to some extent, we can 
control this variance by choosing the kernels appropriately. The larger the support of 
the kernels, the more these variables represent spatial averages, and the slower we may 
expect their variance to grow; thus if the scheme is to be accurate, it is most likely that 
the functions g^ should not be narrow, (5-function-like objects. Furthermore, the kernels 
should not have disjoint supports (for an analogous observation in the theory of vortex 
methods, see, e.g., B). Rigorous error bounds for the linear case can be found in 



These conditions allow us to call the variables {ga,u) that appear in ( p..2| ) collective 
variables; they are groupings of variables. Note that as the number of collective 
variables increases, optimal prediction homes in an ever smaller set of initial data, and 
the variance of the variables {ga,R{u)) should decrease. 
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Finally, the presentation above started from a discussion of weak solutions, and 
indeed in all the examples below the solutions will be weak; why is that so? We shall 
present a detailed mathematical analysis elsewhere; here it should suffice to comment 
that if the solutions are not highly oscillatory on several levels, there is less interest in 
analyzing methods that fail to resolve them; solutions that do oscillate significantly on 
several scales appear, on the largest scale, as non-smooth and therefore weak. 



3 Regression for Gaussian variables 

It was mentioned in the previous section that once the data that condition the state 
of the system are found, or, at time t = once the initial data have been chosen, 
the remainder of the solution can be replaced by a regression. Formula ( |2.5| ), the first 
order optimal prediction formula, is also a regression formula (an average conditioned 
by partial information). To illustrate regression, and more importantly, to remind the 
reader of formulas that will be used in the sequel, we collect in the present section some 
regression formulas for Gaussian measures, both for the discrete and the continuous 
case. More details can be found in standard books (e.g. ||21| as well as in p). 



We start by describing how to perform regressions on discrete sets of Gaussian (nor- 
mal) variables. Let u = {ui, . . . , u„) be a real vector of jointly normal random variables; 
it has a probability density /(u) of the form, 

P(Si < Ui < Si + dSi, . . . , Sn < Un < Sn + dSn) = /(s) dSi . . . dSn = 

= Z^^ exp + b • dsi . . . dSn, (3.1) 

where Z is the appropriate normalization factor, s = (si, . . . , s„), and the n x n matrix 
A with entries a^j is symmetric, positive definite, and has an inverse A^^ . The matrix 
A~^ is the pairwise covariance matrix with elements 

a~j^ = Cov {ui,Uj} = (uiUj) - (ui) (uj) , 

where the brackets, (■), denote averaging with respect to the probability density; the vec- 
tor b with components bi is related to the expectation values of u, (u) = ((mi) , • • • , {un)), 
by 

A'^h = (u) . 

The distribution is fully determined by the n means and by the |n(n -|- 1) indepen- 
dent elements of the covariance matrix, and therefore all the expectation value of any 
observable can be expressed in terms of these parameters. 

Next, we assume that the random vector u satisfies a set of conditions of the affine 
form, 

gaiUi = 14, a = 1, . . . , < n, (3.2) 



where the index a enumerates the conditions and summation over repeated indices 
is assumed. Each equation in (|3.2| ) is a discrete analog of one of the equations in 
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( |1.2| ). The N X n matrix G, whose entries are gai, determines the full set of conditions. 
To distinguish between the random variables . . . and the collective variables 

{Vi, . . . , Vjsf), we enumerate the former by Roman indices and the latter by Greek indices. 

Our goal is to compute regressions, for various functions (p, i.e., condi- 

tional expectation values, or equivalently, averages over the functions that satisfy the 
conditions. We state three lemmas that will become handy below; for proofs, see 0. 

Lemma 3.1 The conditional expectation of the variables Ui is an affine function of the 
conditioning data V^: 

{Ui)y = QiaVa + Q, (3.3) 

where the n x N matrix Q whose entries are the qia and the n-vector c whose entries 
are the Ci are given by: 

Q = {A-'G^){GA-^G^)-\ 

c = {A~^G^){GA-^G^)-\GA-^h), (3.4) 

where the dagger denotes a transpose. 

Lemma 3.2 The conditional covariance matrix has entries 

COV{Ui,Uj}y = {UiUj)y - {Ui)y {Uj)y = 

= [A-' - iA-'G^)iGA-'G^)-\GA-')]^^ , (3.5) 
where the subscript denotes the {ij} component of a matrix. 

Lemma 3.3 Wick's theorem holds for constrained expectations, namely, 
I p 

n K - {^ip)v) 

P odd 



^^perm {^n ; 

}y-Cov{ 

}y P even 



(3.6) 



where the summation is over all possible pairings of the P coordinates that are in the 
list. 



Equation ( p. 3] ) shows that conditioning data alter expectation values linearly in the 
Va and independently of multiplicative factors in the covariances. Equation ( |3.5| ) shows 
that conditioned covariances are determined by the matrix G alone, without reference 
to the Va- Equation (|3.6| ) shows that the conditioned Gaussian distribution, while not 
satisfying the requirement that the covariance matrix be non-singular, retains a key 
property of Gaussian densities. 

In the applications below we shall use Gaussian variables parameterized by a con- 
tinuous variable x, i.e., Gaussian random functions u = u{x). Their means, {u{x)), and 
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covariances, a~^{x,y) = {u{x)u{y)) — {u{x)) {u{y)) , will be defined for all {x,y) in an 
appropriate range rather than only for integer values of The matrix A^^ becomes 

the integral operator whose kernel is a function a~^. The kernel a = a{x,y) of the 
operator A inverse to A~^ is defined by 



a {x,y)a{y, z) dy = 6{x, z). 



The vectors with entries gai, become functions ga{x), and the conditions ( |3.2| ) become 
equations (|1.2|) . The regression formula, ( ^Tsf) , then changes into 



9p{y)u{y) dy 



where 



= \ j a ^{x,y)g^{y)dy\m^l 



and the are the entries of the matrix M ^ whose inverse M has entries 




ga{x)a ^{x,y)gp{y)dxdy. 



The formula for the constrained covariance can be obtained from ( |3.5|) by replacing each 
i by an x, each j by a y, and each summation over a Latin index by the corresponding 
integration. Wick's theorem is still valid with the appropriate changes in notation; note 
that the Greek indices, which refer to the N initial data, remain integers. 



4 An overview of invariant measures and Hamilto- 
nian formalisms 

Before proceeding with our numerical program, we summarize some material on Hamil- 
tonian systems, invariant measures in finite and infinite dimensional systems, and the 
properties of certain measures. This material can be found in books on quantum field 



theory and related topics (see, e.g., ||T^, |T^, |23[). More specific references will be given 



below; we take a very elementary point of view. 

A Hamiltonian system is described in terms of n "position" variables, qi, and their 
associated "momenta", pi, i = 1, . . . ,n; a. Hamiltonian function H = 'H{qi,Pi) = 'H{q,p) 
is given, and the equations of motion are 

-^=n,., ^ = -^., ^ = h...,n. (4.1) 



If the initial values of the 2n variables q, p are given, it is assumed that the system ( [4.1[ ) 
has a global solution in time. Suppose the initial data are chosen at random in that 
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2n dimensional space, with a probability density f{q,p, 0); it is easy to check that the 
probability density of the g's and p's at time t satisfies the Liouville equation (see ||T0[|): 



/.+E 



dqj dpi 
dt ^ dt 



(4.2) 



An invariant density is a probability density that does not depend on time, i.e., one 
that satisfies equation ([4.2|) with ft = 0. Recall that by the definition of a probability 
density, / > 0, and j fdpdq = 1 (with obvious notations). One can readily see that 
any function of Ti. with these two properties is an invariant density; the one that is 
natural for physical reasons is / = exp(— 7i/T), where Z is a normalization constant 
and T is the "temperature". We set T = 1 unless specified otherwise. If the initial 
data are sampled from this initial distribution, and each of these samples is used as 
an initial datum for the equations of motion, then the probability distribution of the 
variables q and p at any later time t will be the same as it was initially. We now 
wish to generalize these notions of Hamiltonian systems and invariant distributions to 
the infinite-dimensional case, where the equations of motion will be partial differential 
equations and the invariant distributions will be called "invariant measures" . We do so 
by way of an example that will be used in later sections. 

Take the interval [0, 27r] and divide it into n segments of length h. At each mesh 
point jh, j = 1, . . . ,n define variables qi,Pi ; introduce the "Hamiltonian" 



i=l 



{P 



'i+l 



Pi 



2/^2 



+ 



i+l 



2/l2 



F{Pi,qi) 



(4.3) 



where values of q, p outside the interval are provided by an assumption of periodicity, and 
the term in brackets is a Hamiltonian density. Consider the set of ordinary differential 
equations: 



dpi 
dt 



^hq 

where A/^g = (gj+i — 2qi + qi^i)/h^ , and similarly, 



(4.4) 



(4.5) 



Note that the right hand side contains a factor that has no analog in the finite 
dimensional system above; its effect is to differentiate the Hamiltonian density rather 
than the Hamiltonian itself. This modification is needed to get self-consistent limits 
as /i — >■ 0, a limit operation we shall now undertake (see |]TU|, 0]). As — > 0, these 
equations formally converge to: 



Pt = qx 



11 



(4.6) 



qt 



'Pxx ~l~ 



pi 
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or, writing u = q + ip with imaginary i and F'{u) = Fq + iFp, we find an equation of 
Schrodinger type: 



lUt = -u^x + F'{u). 
The Hamiltonian, ([4.3|), converges formally to 



2-IT 



-\u,\^ + F{u) 



dx. 



(4.7) 



where the vertical lines denote a modulus. One has to examine what in these passages 



to a limit is justifiable, see e.g. p3| for a physics analysis, and for a mathematical 
analysis. 

For every finite h, the (finite dimensional) measure 



Z ^ exp 



2h^ 



+ F 



(4.9) 



is invariant for the system ( [4.4| - |4.5| ). Note that this is true both with and without the 
extra factor in equations ([4.4| - [4.5|) . What is the limit of this measure as /i — 0? 
Set for a moment F = 0. The exponential in ( [4. 3D factors into a product of terms each 
one of which is the exponential of a single difference quotient, of the form exp[— (pj+i — 
Pi)'^/2h] or with q replacing p. Hence, the p variables are independent of the q variables. 
Furthermore, the "increments" pj+i —pi, qi+i — qi, are obviously Gaussianly distributed, 
have a variance proportional to the distance h, and are all independent of each other. 
Thus in the limit, the functions q{x), (and similarly for p{x)), are made up of independent 
Gaussian increments. They differ from Brownian motion (see P|) by being periodic 
rather than satisfying g(0) = (they are "Brownian bridges" - which does not make 
a deep difference). Also, as long as F = 0, the common value of g(0) and g(27r) are 
undetermined because the exponent of the exponential is unchanged when one adds a 
constant to the q^; one can remove this degeneracy by adding a term to the exponent 
that is sensitive to the value of q{0). Thus, the limit of 



exp 



(» 



2h^ 



+ 



{Pi+i - Pi 
2h^ 



+ Fiq,p) 



dqi ■ ■ ■ dqndpi ■ ■ ■ dp„ 



can be written as 



dBc ■ exp( 



Fdx), 



(4.10) 



where Be is a suitably conditioned Brownian (Wiener) measure. 

As is well-known (see, e.g., 0), a sample Brownian path is, with probability one, 
nowhere differentiable. This fact is not changed by the factor exp(— J Fdx) in ( 4.101 ); 
thus if we sample initial data from ( [4.10| ) we obtain weak solutions of the equation of 



motion ( [4.7|) , as was indeed assumed in section |[ In addition, the integral J \ux\'^dx 



diverges, so that the limit in ( [4.3|) is purely formal; its meaning is given by equation 
( [4.10| ). An important consequence of these facts is that the exponential in ( [4.9| ) tends 
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to zero as /i ^ 0; this is indeed necessary if we are to have a reasonable function-space 
measure: As /i 0, we have a measure on a space of increasing dimension, the density 
of functions that satisfy the set of inequahties Sj < < Sj + dsi,i = 1, . . . ,n, should 
decrease as n increases, thus exp(— 7i) should tend to zero and H should diverge. 

Weak solutions that are spatially like Brownian motion are difficult to resolve; dif- 
ference quotients do no converge, and the Fourier series expansions of these solutions 
converge slowly. We are thus consistent: our machinery applies in problems where it 
is indeed needed. Such problems are not exceptional; for example, in the vanishing- 
viscosity limit, the solutions of the Euler equations have a Holder exponent of 1/3, i.e., 
they are even less smooth than Brownian motion (see ||20|| ). 

Finally, a computational comment that will be useful below: Brownian motions and 
Brownian bridges are easy to sample via interpolation formulas related to the regression 
formulas of the previous Section ^ (see e.g. 0,[jl3); to modify the measures so as to 
take into account the factor exp(— / Fdx) requires simply that the samples be rejected 
or accepted with some form of a Metropolis algorithm. 



5 Conditional expectations with a non-Gaussian prior 

Our goal in this section is to introduce a systematic approach for solving the equations 
of optimal prediction ( ^.Sj ) for nonlinear equations of the form ( |1 . 1| ) . We assume that 
equation ( |1 . 1| ) has Hamiltonian form, i.e., we assume that there exists a Hamiltonian 
Ti. such that ( |1 . 1|) is the Hamilton equation of motion with this Hamiltonian. We can 
therefore assume the existence of a prior measure, iJ,{u) whose density / has the form 

/(n) = Z-V«("), (5.1) 

with Z a normalization constant. 

In order to solve equation (|2.5| ), we must compute the conditional expectations on 
its RHS, 

(g, {Riu))y) , (5.2) 

with R being the RHS of equation ( |1 . 1| ) . This task is relatively simple when the prior 
measure fi is Gaussian; we address here the problem of what to do when it is not. The 



method we present is based on perturbation theory, see e.g. |T3|, |T3|. The idea is 
to reduce the computation of ( ^.2|) to the computation of regressions with respect to a 
Gaussian measure via a perturbation expansion. 

We start by splitting the Hamiltonian into two parts, 

n = n° + n\ (5.3) 

Here Ti" is quadratic (producing a Gaussian measure /i°) and is a non-quadratic 
perturbation. A conditional expectation of a functional J-'{u) is defined as 

{J^)y = [ J^{u)dfiv, (5.4) 
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with the normahzation constant Z = J dfiy. Averages with respect to Gaussian mea- 
sures will be singled out by a superscript 0: 



{J'fy = Z,' j J^{u)dfi'y, (5.5) 
with Zq = J dfiy. Based on the Hamiltonian split, ( |5.3|) , we can carry out the expansion 



oo 

-n _ _ 



e'^ = e-'j:'-^{nr, (5.6) 



k=0 

which can be then utilized to write 

\k 



k=0 ' 



Hence, combining (|5.4|) , (|575|) and ( |5.7|) we find 



^ k\ z ' 

fc=0 



The ratio Zq/Z is: 



fc=o ■"■ fc=0 

and therefore: 



" E£oti>i{(H')')° ■ 

In particular, for a = 1, . . . ,N, 



(^„, (/?(«)),,) = ^. (5.^ 

The RHS of (|5.8|) is now written in terms of expectation values that we already know 
how to compute since they are averages with respect to a Gaussian measure. Note that 
the division by Z/Zq can be avoided by removing certain terms from the numerator 
of ( ^.8|) ; indeed, (|5.8|) is a conditioned expansion in Feynman diagrams and it can be 
normalized by removing unconnected graphs (see [|, |T^, |T3|). We choose not to explain 
this fact here. Note also that the leading term in the expansion, (corresponding to 
= 0), where the measure ny is simply replaced by /i^, already contains a contribution 
of the nonlinear terms in the equation of motion. 

Computing a finite number of terms in the series expansion (|5.8| ) can still be a 
relatively complicated task, and we demonstrate in §|] a step-by-step solution of a model 
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problem. The reader should not be unduly worried by the complexity of some of the 
expressions, because : (i) To make the exposition as clear as possible, no advantage 
is being taken here of various ways of simplifying the expressions, such as, e.g., using 
orthogonal functions for the (ii) much of the algebra can be automated (see below). 

The partition of Ti into Gaussian and non-Gaussian parts is not unique, and in 
section |^ and ^ below we shall discuss some ways to optimize it in order to gain accuracy. 



6 A perturbative treatment of a nonlinear Schrodinger 
equation 

We now utilize the perturbation method to approximating solutions of the non-linear 
Schrodinger equation of the form ([4.7D, 

lUt = —Uxx + ^ ["^ I'^^l^ + '^*^] ' u = q + ip, (6.1) 



in the interval [0,27r], with periodic boundary conditions. Equation (|6.1| ) can also be 
written as the pair of equations, 

Pt = Qxx - g^ 

(6.2) 

qt = -Pxx+P^- 
The corresponding Hamiltonian is 

nip,q) = \j [pl{x) + ql{x) + ^ [p\x) + q\x)\^ dx, (6.3) 

and hence, equations ( |6.2| ) preserve the canonical density 

/o(p,g) = Z-V«(^'«). 

(Note that the temperature T has been set equal to 1). To simplify the example, we 
follow and use the same kernels in the definition of the collective variables for p and 
g, i.e., the collective variables are 

{UlUl} = {{ga.p).{ga.q)}. a = l,... ,iV, (6.4) 

and their initial values, V^, V^, are given. The system of ordinary differential equations 
arising out of equation (|2.5|) for the V^, is: 



+ 1^ - (^?"' )v) 



dt ' V ^^'V ' ' '''' 

a = l,...,N. (6.5) 

dV^ ( 



a_ 

dt V^"' dx'^ 



- 1^ + (^?°' {P')v) ■ 
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We therefore have to compute the four terms c^xx {Qa.dxx {p)v)^ i9a,{Q^)v) 

and {ga, 

The first step is to spht the Hamiltonian into two parts - a quadratic and a non- 
quadratic part, TC = TiP + This will be done in the next section. We will just note at 
that point, that VP will be of the form = | / [pli^x) + ql{x) + ml {p^{x) + q^{x))] dx. 
Once this is done. The RHS of ( |63|) can be computed following ( |5.8| ). For example, 
the term {g^, {p^)y) can be obtained by substituting for R{u) in that equation. In 
particular, the zeroth, leading term is: 

{P%) , (6.6) 

(note that this term already includes a nonlinear effect); a first-order (in the perturbation 
series) approximation will add the term 



9a, {p'n') 



V 



(1 - mv) 



(6.7) 



and so on, with similar expressions for the rest of the terms on the RHS of ( p.5|) . Note 
that all these expansions are used to solve the equations of first-order prediction; in 
principle, we can perform higher-order predictions by using higher moments in setting 
up the matching of conditions before equation (|2.5| ), and then improve the evaluation of 
the right-hand side of equation ( |2.5| ) be using more terms in the perturbation expansion. 
We consider here only the latter possibility. 

The leading-order terms ( p^ ) can be computed using the results of For the 
first-order term, (16.71), we have 



V 



p'n% = \(p' j ¥ + '^mlip' + l')] dz 



and we therefore have to compute 





(p' j fdz^^ = jy{x)f{z))ldz, (p' j q'dz^^ = jy{x)q\z))ldz 

for s = 2,4. To summarize, the RHS of ( |6.5| ), involves integration (with respect to z) 
and differentiation (with respect to x) of the following (with j = 1, 3 and s = 2, 4) 

{pi{x)f{z))l, {p^{x)q%z))l, {q^{x)p%z))l, {q^ {x)q%z))l . (6.8) 



The terms in ( |6.8| ) can be computed by application of Wick's theorem (Lemma |3.3| ); 
the algebra can be performed with aid of a symbolic computer program (such as Math- 
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ematica). In particular, one obtains the identities: 

{p{x) p\z)) = -2 {p{x)) {p{z)f + 2 {p{z)) {p{x)p{z)) + {p{x)) {p\z)) . 
{p{x)p'^{z)) = 6 {p{x)) {p{z)f - 8 {p{z)f {p{x)p{z)) - 



-12 {p{x)) {p{z)Y {p\z)) + 12 {p{z)) {p{x)p{z)) {p\z)) + 3 {p{x)) {p\z)) . 
{p\x)p\z)) = -12 {p{x)f {p{z)) {p{x)p{z)) + 6 {p\x)) {p{z)) {p{x)p{z)) + 

+ {p{x))'U ^p[z)f-2{p\z))\ + 



{p\x)p\z)) = 72 {p{x)f {p{z)) {p{x)p{z)) {p{z)f - {p\z)) + 



+ {p{x))' ( - 20 {p{z))' + 36 (piz))' {p\z)) - 6 (p'iz))' ) + 



6 {p{x)p{z)f + (/(x)> - 6 {p{z)f + 3 {p\z)) 



+ 12 {p{z)) {p{x)p{z)) 



2 + {p\x)) - 2 + 3 {p\z)) 

A{p{x)p{z)f ( - 2{p{z)f + {p\z))\ + 



+ 



+9 



+ (p2(x)> 2 {p{z))' - 4 + (fi^z)) 



(6.9) 



All the averages in equation ( |6.9| ) are of course conditioned by the constant repetition 
of the subscript V and the superscript has been avoided for esthetic reasons. 

We now pick the kernels ga{x) to be translates of a fixed function g{x), i.e., ga{x) = 
g{x — Xa), which is a normalized (not random!) Gaussian with periodic images and 
width a: 



9{x) = -j^ Y] exp 



(x - 27rr)^ 



We note that the Fourier representation of g{x) is g{x) 
this choice of kernel functions we can write 



2tt Z^k=-oo ^ 



N 



(6.10) 
Given 

(6.11) 



a=l 



where 



1 ^ °° p-3'='<^' 

C(x) = cl^ix) = -J2Il yr—-2 



eyiY>[ik{x - xp)][m ^]^^, 



and 



pp 



1 e ■2'' " 



- y 



27r ^ P + m^ 



exp[iA;(xa — X/3)]. 
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This choice of kernels is the same as in previous work Q • It is far from optimal in 
the context of perturbation theory; in particular, orthogonal kernels such as the Fourier 
kernels used in section 7 below reduce the number of non-zero terms in the expansion. 
We thought that we should present at least once a perturbative calculation with a 
general kernel. 



6.1 The partition of the Hamiltonian 

We now turn to the question of how exactly the Hamiltonian should be divided into 
a sum of a quadratic part and a perturbation, Ti = HP + Of course we wish H} 
to be as small as possible, so as to have a perturbation series that behaves as well as 
possible; we therefore perform a partition with a few free parameters over which we 
shall minimize 7i^; we choose to write: 

H° = ^ / {\u.,\^ + ml\u\^ + h)dx, (6.12) 







V} = - ! {F -ml\u\^ -h)dx, 
2 Jo 



where, as before, u = q + ip, F = ^{p'^ + q'^)- There are no odd powers in the partition 
because the measure is invariant under the reflection u ^ {—u); note that the term 
ttIq \uf removes the indeterminacy in the Gaussian measure defined by Ti^. This is 
not the only partition that can be considered; one could for example add and subtract 
squares of fractional derivatives of u. The task at hand is to choose good values for the 
parameters tuq and b. They cannot in general be chosen so as to make the perturbation 
series convergent; what one would really want is to make the measure defined by Ti 
and conditioned by the Vq, be a small perturbation of the conditional measure defined 
by 7i°, and while this can in principle be done, and would presumably lead to time- 
dependent equations for mo and b, it is reasonable, as a first try, to choose mo and b 
so as to minimize (Ti.^) , ((7i^)^), where the averages are unconditional. Note also that 
the presence of the term b leaves all averages with respect to the measure defined by 
unchanged — it gets absorbed into the normalization constant Z — but it does affect the 
expansion of exp(— 

A straightforward algebraic manipulation, simplified by the fact that p and q are 
uncorrelated and that {p'^) = {q'^) for all s, leads to 



:(wv>°=i/ / d.dz 



2tt p2TT 
z=0 Jx=0 



2 {p\x)p\z))' + 2 [{p\x)) 
-8ml {p\x)p\z)f - Sml {p\x)f {p\z)f + 
+8m^ (<p'(a;)>°)' + 8m^ {p\x)p\z))' 



ON 2 



(6.13) 



where the (■)°, denotes average with respect to the unconditional Gaussian measure. 
Using the expressions ( |6.9| ), one can numerically minimize (|6.13|) as a function of mo. 
This minimum is obtained at mo ~ 1.055. 
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After setting mg, we are free to pick the second constant, b, in ( |6.12| ). Changing b 



will not affect the variance which we just computed in ( 6.13 ). We choose to set the first 
moment of the perturbation to zero, i.e.. 



Given the partition, ( |6.12D , it is clear that ( p.l4| ) holds if 



(6.14) 



(6.15) 



With the aid of Wick's theorem and equation ( |6.11|) , equation (|6.15| ) can be rewritten 
as 



bn 



3 

Ait 



oo 



1 



k"^ + itlq 



rrin 



fc=— oo 



k'^ + rrLn 



(6.16) 



If one chooses mo = 1.055 so as to minimize the second moment of (7i^), the RHS of 
( |6.16| ) equals —1.197, which in turn determines b = —0.381 so that the first moment of 
the perturbation will vanish. Note that even the zero-th order expansion uses informa- 
tion about the higher-order terms, since the parameters that determine the partition 
and thus the zero-th order term depend on an analysis of the later terms. 

One should also note that once the future conditions have been determined, one has 
to perform further regressions to obtain the mean solutions at various points in space; 
the machinery there is exactly analogous to what has just been done, and will not be 
spelled out here. 



6.2 Numerical checks 

We now check the perturbation series as well as the optimal prediction scheme by com- 
paring the results they give with numerical results obtained at substantial expense by 
sampling the initial conditions, solving the differential equations over and over, and 
averaging. We concentrate in the present section on the case N = 2, i.e., a case where 
we have initially as data only four values of collective variables, and are trying to find 
the mean future conditioned by these four values. We display only the variation in time 
of these collective variables, U^, U^, a = 1,2, which were defined in equation (|6.4|); The 



kernels are taken as ga{x) = g{x — Xa) with xi = tt/S and X2 = 97r/8. The parameter 
a is set as vr. We first use the perturbation series with the optimal values mo = 1.055 
and b = —0.381 computed in the previous section |6ri| . We then display results obtained 
with different values of mo and b. Specifically, we choose mo = 0.9 and 6 = 0, obtained 
by splitting the Hamiltonian into 



H° = i / {\u,\^ + ml\uf)dx 
2 Jo 

= - [ {F -ml\u\^)dx, 
2 Jo 
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and requiring only that the first moment of the perturbation vanish, i.e., {H})^ = 
(compare with ( |6.12| ) and ( 6.14 )). 

We check these results by replacing the continuum equations (|6.2D, (|6.5D, by a formal 
finite difference approximation conditioned by discrete forms of these equations, and 
then display the convergence of the conditioned mean of many solutions of the difference 
equations to the optimal prediction obtained by the perturbation analysis. Specifically, 
we replace equation (|6.2|) by the following difference equations. 



f dpjj) _ q{j - 1) - 2q{j) + q{j + 1) 
dt 

dq{j) 



dt 



(6.17) 



where h = 27r/n is the mesh size. The conditions (3]^) are replaced by the following 
discrete approximations: 



a 



1,2, 



(6.18) 



(A factor h has been introduced in the definition of the collective variables to allow 
them to converge to the continuum collective variables {ga,u) = J ga{x)u{x)dx.) 
The Hamiltonian (|6.3D is replaced by the discrete Hamiltonian: 



Hip.q) 



(6.19) 




p(j + 1) - pU) 

h 



+ 



g(j + 1) - 
h 



+ ^[p\j) + q\j)] 



We present results obtained with two mesh sizes, corresponding to n = 8 and n = 16. 
We also checked that the results with n = 32 are very close to the results with n = 16. 
For each mesh size h we use a Metropolis Monte-Carlo algorithm to find 5000 initial data 
drawn from the distribution defined by ( |6.19| ) conditioned by the values of the collective 
variables, integrate the equations in time up to t = 1, and average the results at various 
fixed time intervals. This numerical calculation is very costly, even for moderate values 
of n, but it is independent both of the perturbative analysis and of the machinery 
of optimal prediction. It is important to note that as the mesh size h tends to zero, 
the results of a standard (i.e., non-averaged) finite-difference solution of equation ( |6.2| ) 
with data drawn from the distribution ( |6.19|) diverge pointwise as h ^ 0. Conditional 
averaging provides the only meaningful numerical solution of equations ( |6.17| ) for such 
initial data. 

In Figure B?T| we present the evolution in time of the four collective variables f/f , 
[/|, Uf and U2 with the optimal mo = 1.055, b = —0.38. Figure presents the plots 
corresponding to the choice mo = 0.9 and 6 = 0. The zeroth-order solution is the optimal 
prediction solution obtained with only the zeroth, leading, term in the perturbation 
expansion (see, e.g., equation ( |6.6|) ). The first-order solution is the optimal prediction 
solution obtained after adding a first-order correction to the perturbation series. Note 
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however that the optimal choice of parameters uses information about terms of order 
one. 

In both figures, the sohd hues are the solutions obtained with the optimal prediction 
equations, while the dotted lines represent an average over over 5000 solutions that 
evolve from initial data sampled from the conditioned discrete Hamiltonian 

Clearly, there is an improvement when one uses additional terms in the perturbation 
series. Also, even though the individual numerical solutions converge only weakly to 
a continuum limit, the average over numerical solutions with data sampled from the 
discrete Hamiltonian on one hand and the solution of the optimal prediction equations on 
the other hand get close as n increases. The perturbation expansion converges rapidly; 



the key graph is the one on the bottom right of Figure |6.1| : the comparison between 



the expansion up to first-order with the average numerical solution with n = 16. One 
conclusion we draw from these graphs and use in section 7 below is that with the optimal 
partition of the Hamiltonian one can obtain an accurate solution with only the zero-th 
term in the expansion; the computation of the optimal parameters in the expansion uses 
the first-order term. 

The limitations of first-order optimal prediction are displayed in Figure |6.3| , where 
the integrations are carried out to longer times. We are working with a temperature 
T = 1, i.e., the fluctuation in u are of order 1; by contrast, in the longer runs of 0,0 
we used a smaller temperature T = 7r/15; we also have only 4 collective variables, not 
enough at this temperature to keep the variances of the collective variables small. In 
Figure [67^ the optimal prediction solution is based on the optimal choice of mo = 1.055, 
h = —0.38, and the discrete solution is presented for n = 16. Once again, the numerical 
solution is an average over 5000 individual solutions. 

The effect of temperature is displayed in Figure |6.4| , where we present the time 
evolution of the four collective variables, U^, up to time t = 5. Both graphs are for 
the same initial values of and but differ in the temperature which determines 
the distribution of initial data. The graph on the left corresponds to low temperature, 
T = 0.2, whereas the graph on the right corresponds to high temperature, T = 4. As 
expected, the higher the temperature, the faster is the decay towards equilibrium; the 
collective variables tend faster towards their equilibrium value of zero, and the first- 
order prediction scheme with a small, fixed number of conditions loses accuracy faster. 
There are two ways to improve the prediction: Go to more sophisticated prediction 
theory, as outlined in section 2, or increase the number of collective variables. The 
first alternative will be explored in later publications; the value of the second approach 
will be shown in the next section, with a choice of kernels that reduces the amount of 
labor and also makes possible an analytical estimate of the difference between optimal 
prediction and a simple scheme that makes no use of the prior measure. Note that the 
optimal prediction runs yield good results when the standard deviation of the values of 
the collective variables is as large as 50 percents of their mean (the standard deviation 
of the pointwise values of the solutions is much larger still). 
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Figure 6.1: Time evolution of four collective variables f/f, f/f, Ul and [/| for the nonlinear 
Schrodinger equation ( |6.2|) ,( |6?17D , with the optimal mo = 1.055, b = —0.38; Solid lines 
- optimal prediction equations. Dotted lines - average over 5000 solutions obtained from 
initial data sampled from the discrete Hamiltonian ( |6.19D with n = 8 and n = 16 points. 
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optimal prediction witin zerotln-order expansion, n=8 
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optimal prediction with zeroth-order expansion, n=16 
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Figure 6.2: Time evolution of four collective variables f/f, f/|, Ul and f/| for the nonlinear 
Schrodinger equation ( |6^) ,( |07|) , with mo = 0.9, 6 = 0; Solid lines - optimal prediction 
equations. Dotted lines - average over 5000 solutions obtained from initial data sampled 
from the discrete Hamiltonian ( |6.19| ) with n = 8 and n = 16 points. 



Optimal prediction for Hamiltonian partial differential equations 24 



optimal prediction with zeroth-order expansion, n=16 
1 




optimal prediction with first-order expansion, n=16 
1 




-0.5 



Figure 6.3: Longer time evolution of four collective variables f/f , U^, Uf and t/| for the 
nonlinear Schrodinger equation (|6.2|),( |OTD , with the optimal tuq = 1.055, b = —0.38; Solid 
lines - optimal prediction equations. Dotted lines - average over 5000 solutions obtained 
from initial data sampled from the discrete Hamiltonian ( |6.19| ) with n = 16 points. 
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Figure 6.4: Long time evolution of the collective variables for initial distributions of the form 
e-^/^ with (a) T = 0.2 and (b) T = 4. 
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7 Pseudo-spectral optimal prediction for a model 
nonlinear problem 

In the present section we consider a discrete version of the same Schrodinger equation as 
above; our goal is to show exphcitly how the information in the prior measure improves 
the accuracy of an underresolved nonhnear calculation. One of the striking facts shown 
by the example is that first-order optimal prediction is useful in nonlinear problems. 
In view of the rapid convergence of the conditional expectations of discrete problems 
to their continuum limits, as displayed in the previous section, we shall be content 
with the discrete problem. Specifically, we shall contrast the solution of a discrete 
problem by a pseudo-spectral method that takes no cognizance of the prior measure 
with a closely related optimal prediction scheme with Fourier kernels, and show how 
the information in the prior measure improves the predictions. In the present section, a 
complex-function formalism turns out to be more transparent, and we therefore slightly 
change the notations. We consider a set of complex ordinary differential equations which 
is a formal discretization of our Schrodinger equation, (|6.1|), 



dt 4 



[3\u/uj + iu*f], (7.1) 




with j = 1, . . . ,n and h = 211/71. Equations ( [7. 1|) are the Hamilton equations of motion 
derived from the following Hamiltonian, 

+ ^K + 6Kr + («*)^]| (7.2) 

(see section 4 above). One can readily verify that this is the same discrete Hamiltonian 
as before. The prior measure is the canonical measure whose density is 

fiu) = ie-^("). (7.3) 

Note that we write u without boldface, as we did in the case of functions, but not as we 
did for vectors; this is done by analogy with the previous sections on the Schrodinger 
equation. (Those who look at our earlier papers 0,0,0 will notice that the measure 
here differs from the measure used there by a factor h in the exponent, and also that 
we used a different spatial period.) Equation ( [7.3| ) requires an explanation when the 
variables uj are complex. A set of n complex variables can be treated as a set of 2n 
real variables. However, when the real and imaginary parts qi,Pi of each variable Ui are 
independent and their means are zero, (as they are here), one can write: 



P{si <qi<si + dsi, . . . , ri < pi < ri + dn, . . . , r„ < p„ < r„ dr^) = 

= F{z) dzi . . . dzn = Z^^ exp ^— -(z, Az*)^ dzi . . . dz^, (7.4) 

where Zi = Si + tVi, dzi = dsidri, and the matrix A is hermitian. When ?i is a quadratic 
function of the vector u, equation ( [7.3| ) defines A in the complex case. 
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The prior measure ( |7.3| ) being non-Gaussian, we proceed as above and partition 
the Hamiltonian into a quadratic part plus a perturbation, Ti = + 7i^. To make 
the example amenable to analysis we keep only the leading term in this expansion; 
we already pointed out that the leading term contains a contribution of the nonlinear 
terms in the equation and that the partition takes into account higher order terms in the 
expansion. As explained above, there is here only one relevant partitioning parameter, 
niQ. Thus we approximate the probability density by f{u) = Z^^e^^ where 

n\u) = \ ± |fe^±i_^ + hml . (7.5) 

We work here with n = 32. In this finite dimensional case, we rederived the optimal 
value of mo as follows: We calculated the two-point correlation function {uiU*) for 
the measure ( |7.3| ) by a Monte-Carlo method, and then found the value of mo that best 
reproduces this correlation function by minimizing the mean-square difference between 
this correlation function and the correlation function produced by ( |7.5|) . This yielded 
mo = 1.055, confirming the value obtained above. The procedure used here minimized 
the difference between the full measure and its quadratic piece while the more general 
procedure of the previous sections minimizes only the first few terms in an expansion; 
it is comforting that the results agree. For readers of our earlier papers 0, |^, 0, we 
point out that the present procedure differs from the "Gaussianization" proposed there 
by using an analytical expression for the approximate measure, whereas in the previous 
publications we needed to store the full covariance matrix. Furthermore, the present 
construction produces a first term in a systematic expansion. 

In Figure [7?1| we compare the exact two-point correlation function (uiU*) obtained 
by a Monte Carlo sampling (open circles) to the Gaussian approximation (solid line), 

\0 1 Qtk{xi-Xj) 

Qj = {u,u*) =-2^ 4 ,i^2M , . (7-6) 

fc=l h2 2 ^ "^0 

derived from (|7.5| ). The discrepancy is negligible compared with the statistical uncer- 
tainty in the sampling procedure. 

The formulas for calculating conditional expectations (|3.3| )-( |3l5| ) can be generalized 
to deal with complex functions. Assume we have conditions of the form. 



where the Qai are the complex entries of the N x n kernel matrix G and the entries of 
the vector are the values of the A^ collective variables defined by G. The formulas 
for the conditional means and variances of the complex vector Ui generalize equations 
( ^.31 ) and (|3.5|); the conditional average of Ui is 



the are the entries of the matrix Q = {CG'^){GCG'^)^^, where C, whose elements are 
defined in ( |7.6| ), is the matrix that approximates the inverse of the matrix A defined in 
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Figure 7.1: Comparison between the two-point correlation function (uiU*), as computed 
by a Monte Carlo sampling procedure, and its approximation ( [7^ ) based on the Gaussian 
measure ([7.5|), with ttiq = 1.05. The number of points \s n = 32. 



( [7.4| ). The dagger denotes an adjoint (hermitian transposed) matrix. The conditional 
covariance matrix has entries 

Cov{m„m*}°^ = [C-{CG^){GCG^)-\GC)\^^. 

We now make a special choice of kernels G: We pick them so that what is known 
at time t = is a set of Fourier modes, fewer than are necessary to specify the solution 
completely. This makes the Qai complex exponentials: 

gai = - exp {-iKaXi) . 
n 

If the number of conditions is even, the K^, take the values — y + 1, . . . , Note the 
following property of the resulting matrix G: 

GG^ = -I, (7.7) 

n 

where I is the N x N identity. Spectral variables are particularly convenient here 
because, if u is expanded in Fourier series, and this series is substituted into the formula 
for the Hamiltonian 7i, the result is a sum of squares of Fourier coefficients; the prior 
measure is "diagonal in Fourier space." This can also be deduced from the fact that the 
measure is Gaussian and translation-invariant in the variable x. In particular, after u 
is expanded in Fourier series, the matrices A, G are both diagonal. A short calculation 
yields: 



Q = nG\ 
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from which follows 

{u)l = nG^V, (7.8) 

and 

Coy{u,,u*}1= [C-nG^GC]^^. (7.9) 

Equation (|7.8| ) is the interpolation formula for the first moment of u. Recall that V = 
Gu, hence {u)y = nG'^Gu. The operator nG^G is a Galerkin projection operator 
which projects any vector onto the vector space spanned by the range of G. Thus, the 
mean of u produced by our regression formula equals the mean obtained from a simple 
Fourier series that uses the known coefficients Va (we shall call this Fourier series the 
"Galerkin reconstruction"). This is as it should be: In a translation-invariant Gaussian 
measure the Fourier components are mutually independent; the knowledge of the first 
few does not condition the next ones, whose expected value is therefore zero. However, 
the measure does contain information about the higher moments of the higher Fourier 
coefficients, and this is important in a nonlinear problem. 

The difference between the Galerkin reconstruction and the regression used in the 



optimal prediction of moments is demonstrated in Figure [7.2| . The three graphs depict 
the first three moments, calculated (i) by a Monte Carlo sampling of the measure ( [7.3[ ) 
conditioned by = 4 known collective variables (symbols), a Galerkin reconstruction 
(dashed lines), and our regression formulas (solid lines). We chose a system size of 
n = 32 with = 4 resolved (i.e., known) Fourier modes. For the first moment, (tti)y 
(calculated by Monte-Carlo), the Galerkin reconstruction and the regression are close 
to each other. For the second and third moments the regression, which is the core of 
optimal prediction, is close to the truth, as revealed by the Monte-Carlo runs; the error 
is smaller than the statistical uncertainty in the sampling. The Galerkin reconstruction, 
on the other hand, deviates significantly from the truth. These graphs demonstrate the 
importance of the prior measure, which contains information about the mean squares 
of the unresolved Fourier components. 

We next derive the optimal prediction scheme for our model problem with Fourier 
kernels. Given a set of Fourier modes, Va, we replace the right-hand side of ( |7.1| ) by its 
conditional average, and multiply the result by the kernel matrix G . Thus, 



«^ = 9ai ( p + 4 ^ 1^*1 + K) J 







V 



Substituting the regression formulas ( fTsP and ( |7.9| ) and using Wick's theorem and ( |7.7| ), 
we obtain the following set of N equations: 

= ^ + i 5^ ^^^^^^^^ <f^-''^^ + + l^Vc, 

(7.10) 

where 

c = [(/ — nG'^G)C~\.. (no summation on i) 
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Figure 7.2: Conditional moments u: comparison between true values (symbols), a Galerkin 
reconstruction (dashed lines), and regression (solid lines), (a) The conditional expectation 
{ui)y\ the circles represent the real part and the crosses represent the imaginary part, (b) 
The variance {uf)y. (c) The third moment Re {\uif Ui)^. The number of points is n = 32, 
and the number of Fourier modes that are assumed known is = 4. 
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Figure 7.3: The parameter c as a function of the number of resolved Fourier modes, for 

n = 32. 

is a constant (the right-hand side is independent of i). Note that the last term comes 
from the evaluation of the nonlinear terms by Wick's theorem. The structure of equation 
( [7.101 ) is enlightening: The first two terms on the right-hand side are precisely the 
Galerkin approximation for the evolution of a subset of Fourier modes; they constitute 
a pseudo-spectral approximation of the equations of motion. The third term, which is 
linear in V, represents information gleaned from the prior measure. The nice feature of 
this example is the sharp separation between the contribution from the resolved degrees 
of freedom and the contribution of the "subgrid" degrees of freedom, which happens to 
simply "renormalize" the linear part of the evolution operator. 

One is of course interested in knowing how large is the extra term that makes up the 
entire difference between the optimal prediction scheme and a standard pseudo-spectral 
scheme; this difference is proportional to the coefficient c. In Figure [7.3| we plot the 
value of c as function of for n = 32. As expected, c is larger when the number of 
resolved degrees of freedom is smaller , and vanishes when N = n, i.e., when the system 
is fully resolved. The oscillations in this graph result from the alternation between odd 
and even numbers of Fourier modes. Figure |7.3| demonstrates that optimal prediction is 
consistent: As the number of collective variables increases, its predictions converge to 
those of a resolved calculations (as indeed should be obvious from the derivation); when 
the number of collective variables is small, the corrections due to optimal prediction are 
substantial. 

In Figure [7^^ we compare the time evolution of the first 4 Fourier coefficients pre- 
dicted by our optimal prediction scheme ( [7.10|) (solid lines), the time evolution of these 
modes predicted by a Galerkin scheme (dashed lines), and their exact mean evolution 
obtained by sampling 10^ states from the conditional measure, evolving them in time, 
and averaging the first Fourier components (circles). We do this with A^ = 4 and A^ = 8; 
with A^ = 8 we display only 4 modes even though 8 are calculated, in the interest of 
clarity. To make the calculations with different values of A^ comparable, we pick the 
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Figure 7.4: Time evolution of the real part of the 4 lowest Fourier modes for n = 32. The 
circles represent average values over an ensemble of 10'^ states; the dashed lines result from 
the Galerkin approximation; the solid lines result from our optimal prediction scheme (|7.1C1| ). 
Results are presented for (a) iV = 4 and (b) = 8. 



initial values of the collective variables in the way suggested by the analysis of Hald 



1^ : We sample an initial function u from the invariant measure, and then calculate 



values of the collective variables by performing the summations QaiUi- The graph shows 
that in each case the simple Galerkin calculation deviates immediately from the true 
solution, while the optimal prediction remains accurate. The calculations show that 
optimal prediction improves the accuracy compared to a Galerkin calculation; the time 
during which the optimal prediction remains accurate increases with increasing A^; we 
know from ( [7. 101 ) that the cost of optimal prediction in this problem is small. 



8 Conclusions 

We have exhibited the value of the statistical information used in optimal prediction 
for the solution of an underresolved nonlinear problem, and we have shown that per- 
turbation theory provides a ready-made machinery for applying the ideas of optimal 
prediction to problems where the invariant measure is non-Gaussian. 

The first-order implementation of optimal prediction with a fixed, small, number 
of conditions breaks down after a finite time; the time for which it is valid increases 
as the temperature increases; the temperature determines the variance of the invariant 
measure and thus the uncertainty in the system. There are two ways to improve the 
prediction: Go to more sophisticated prediction theory, as outlined in section 2, or 
increase the number of collective variables. We have demonstrated the power of the 
second alternative; the first alternative will be explored in later publications. 

Many aspects of the algorithms presented here require further work. The closure by 
means of a fixed number of affine conditions is only a first step; other closure schemes 
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will be investigated. In particular, optimal prediction fits within the framework of 
irreversible statistical mechanics, whose apparatus can be brought into use for finding 
closure schemes. More powerful versions of perturbation theory can be readily used. 
A careful perusal of our final example shows that there is a great advantage in using 
orthogonal kernels; in the interest of pedagogy, we have not used this possibility in the 
discussion of perturbation theory, and we have yet to explore orthogonal bases other 
than Fourier bases. 

An inspection of the formulas derived by perturbation theory shows that though we 
assumed a knowledge of an invariant measure, all that is finally used is a set of moments; 
this is the opening for applying optimal prediction methods in problems where less than 
a full invariant measure is known. 

We assume that all these issues will be handled as we progress to more comphcated 
problems, in more dimensions, with dissipation (requiring a careful modeling of effective 
Hamiltonians), and with general boundary conditions. We shall explore problems with 
these additional features in future publications. 
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